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Abstract 

Estimating the size of stigmatized, hidden, or hard-to-reach populations is a major 
problem in epidemiology, demography, and public health research. Capture-recapture 
and multiplier methods have become standard tools for inference of hidden population 
sizes, but they require independent random sampling of target population members, 
which is rarely possible. Respondent-driven sampling (RDS) is a survey method for 
hidden populations that relies on social link tracing. The RDS recruitment process 
is designed to spread through the social network connecting members of the target 
population. In this paper, we show how to use network data revealed by RDS to 
estimate hidden population size. The key insight is that the recruitment chain, timing 
of recruitments, and network degrees of recruited subjects provide information about 
the number of individuals belonging to the target population who are not yet in the 
sample. We use a computationally efficient Bayesian method to integrate over the 
missing edges in the subgraph of recruited individuals. We validate the method using 
simulated data and apply the technique to estimate the number of people who inject 
drugs in St. Petersburg, Russia. 
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1 Introduction 


Estimating the size of stigmatized, hidden, or hard-to-reach populations such as homeless 
people, sex workers, men who have sex with men, or drug users is an important part of 
epidemiological, demographic, and public health research (UNAID^|20106, Bao et ah]|2010 


World Health Organization 2014, Abdul-Quader et ah 2014). Census-like enumeration of 


hidden population members is usually impossible since potential subjects may fear perse¬ 
cution if they participate in a research study. When random sampling of target population 


members is feasible, multiplier (e.g. Heimer & White 2010, Hickman et al. 2006, Quaye 


et al.||2015) and capture-recapture methods (Fienber^|1972 Laska et ah]|1988 Larson et al. 


1994, Hall et al. 2000) for estimating population size may perform well. Unfortunately 


random sampling is often impossible because there is no sampling “frame”; population 
members are not directly accessible to researchers. This difficulty has led researchers to 
develop survey techniques and corresponding statistical tools that do not require random 
sampling and instead rely on properties of social networks. 

In “snowball sampling”, subjects enumerate their social contacts, each of whom enters 


the study, and the process repeats (Goodman 1961). Since snowball sampling reveals the 
network (induced subgraph) of respondents, the sample may carry information about global 


properties of the social network connecting members of the hidden population. Frank fc 


Snijders (1994) estimate hidden population size from snowball samples by making homo¬ 


geneity assumptions about the underlying social network, and David & Snijders (2002) use 
the method to estimate the number of homeless people in Budapest. Further design-based 
approaches to population size estimation using snowball sampling have been developed 


Telix-Medina & Thompson 

2004, 

Felix-Medina & Monjardin 

2009 

Vincent & Thompson 


2012). Snowball sampling is often not feasible because social contacts of participants may 


decline to enroll in the study. When this happens, the subgraph of respondents may be 
incomplete, and estimation of population properties - especially the size of the population 
- may suffer. 

The network scale-up method is an alternative technique in which researchers survey 
members of the general population to determine how many people they know (their personal 
network size), and how many people they know who are members of the target population 
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(Killworth et al. 1998, Bernard et al. 2010). The proportion of respondents’ contacts who 
are members of the target population is assumed to be equal to the population proportion. 
Multiplying this proportion by the known general population size produces an estimate 
of the target population size. The network scale-up method has been successfully used 
to estimate the size of groups at risk of HIV infection, including men who have sex with 


men, injection drug users, and sex workers (Kadushin et al. 2006, Salganik et al. 2011 


Ezoe et al. 2012, Shokoohi et al. 2012, Guo et al. 2013). The method is appealing because 


researchers do not need access to the hidden population, but its validity relies on subjects’ 


knowledge of their contacts’ membership in the target population (Killworth et al. 1998). 


Sometimes membership in the target population is obscured from non-members (Shelley 


er^[I^[200^ , or groups within the general population may have different probabilities 


of ties to the target population (Snidero et al. 2004, Zheng et al. 2006, [McCormick et al. 
2010, Feehan & Salganik 2014| ). 


Respondent-driven sampling (RDS) is a widely used procedure for recruiting members of 
hard-to-reach populations for surveys and interventions that relies on participants to recruit 


other subjects (Heckathorn 1997, Broadhead et al. 1998). Beginning with an initial group of 
participants called “seeds”, subjects are interviewed and given a reward for participation. 
Subjects then receive a small number of “coupons” that they can use to recruit other 
eligible subjects. Each coupon is marked with a unique ID traceable back to the recruiter. 
Subjects recruit others into the study by giving them a coupon that they “redeem” by 
enrolling in the study. When a new subject enrolls and is interviewed, their recruiter 
receives a reward. In this way, the RDS recruitment process is designed to spread through 
the social network of the hidden population. One common feature of all RDS surveys is 
that researchers assess each subject’s network degree, the number of other members of the 
target population the subject knows. Because of privacy restrictions, subjects typically do 
not provide identifying information about members of their social network. Most statistical 
work on RDS has focused on estimators for population means (Salganik & Heckathorn 2004| 


Volz & Heckathorn 2008, Gile 2011 


Does RDS reveal information about the size of the target population? Just as in snow¬ 
ball sampling and the network scale-up method, subjects report how many members of the 
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target population they know. Unlike network scale-up surveys, only members of the target 
population are recruited to participate in an RDS study. In contrast to snowball sampling, 
not all social contacts of the subject are surveyed: in RDS the subjects decide which of their 


contacts to recruit. Despite these limitations, Paz-Bailey et ah (2011) use RDS to perform 
the recapture step of a capture-recapture experiment, even though recruited individuals are 

(see, e.g. 


not sampled uniformly at random from the target population Berchenko & Frost 


2011 

for commentary). Recently 

Handcock, Gile & Mar 

(2014) 

Handcock et al. (2015 


proposed a population size estimator for RDS based on ideas from without-replacement 


sampling proportional to size (Bickel et ah 1992, Gile 2011). Their successive sampling 
size (SS-size) estimator depends only on the time-ordered sequence of observed network 
degrees in the RDS sample. By assuming that RDS is a sampling mechanism that recruits 
individuals without replacement and with probability proportional to their network degree. 


Handcock, Gile & Mar (2014) and Handcock et al. (2015) reason that the average degrees of 


recruited individuals should decrease monotonically with the number of recruited subjects. 
The rate of this decrease is believed to reveal information about the size of the population 
via early depletion of high-degree individuals. The RDS Analyst software implements the 


SS-size method (Handcock, Fellows & Gile 2014). 

In this paper, we take a network-based approach to population size estimation from 
RDS, based on the intuition behind the snowball sampling estimator and the network 
scale-up method. The key insight is that the RDS recruitment chain, timing of recruit¬ 
ments, and the degrees of recruited subjects provide information about the number of 
links between sampled and unsampled population members, and hence the total popula¬ 
tion size. We hrst describe the graphical structure of data obtained from RDS, including 
the recruitment graph and recruitment-induced subgraph. The unobserved portions of 
the recruitment-induced subgraph are treated as missing data. We describe a Bayesian 
framework for marginalizing over the missing edges in the recruitment-induced subgraph 
to estimate population size. The method relies only on data traditionally obtained by RDS 
and does not require a change to current RDS recruitment protocol, nor a separate survey 
of subjects who are not members of the target population. The computational burden of 
the inference procedure scales with the sample size, not the total hidden population size. 


4 






























12345678 



Figure 1; Illustration of the observed data in RDS surveys. At right, the recruitment graph 
Gr is shown overlaid on the population graph G. Vertex 1 is a seed, and an arrow from 
i to j indicates that i recruited j. Several vertices in G remain unsampled. Researchers 
conducting an RDS study observe neither G nor the induced subgraph of the sampled 
vertices. Next the observed data are shown: the adjacency matrix of the recruitment 
graph Gn, the vector of degrees d, the vector of recruitment times t, and the coupon 
matrix C. The numbered rows and columns correspond to the sampled vertices, numbered 
in the order of their recruitment. This hgure is adapted from Crawford (2014). 


We validate the proposed technique using simulated data and apply the method to estimate 
the number of injection drug users in St. Petersburg, Russia. 


2 The graphical structure of RDS data 


In this section, we outline the observed data in typical RDS surveys of hidden populations. 


drawn from the definitions given by Crawford (2014). Suppose that the hidden population 
social network is G = (y,E), where \V\ = N is the size of the target population and G 
contains no self-loops or parallel edges. A vertex in G is recruited if it is known to the 
study. A recruited vertex cannot be recruited again. 


Definition 1 (Recruitment graph). The directed recruitment graph is Gr = {Vr^Er), 
where Vr <Z V is the set of n sampled vertices and a directed edge {i,j} G Er indicates 
that i recruited j. 


Since subjects cannot be recruited more than once, Gr is acyclic. 

Definition 2 (Degree). A vertex’s degree is the number of edges incident to it that connect 
to vertices in the hidden population graph G. 
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Definition 3 (Recruitment-induced subgraph). The recruitment-induced subgraph is an 
undirected graph Gs = (V 5 , Es), where Vs = Vr consists ofn sampled vertices, and {i,j} G 
Es if and only if i E Vs, j G Vs, and {i,j} G E. 

Let d be the time-ordered n x 1 vector of subjects’ degrees in the order they were recruited 
into the study and let t = {ti,... ,tn) be the u x 1 vector of recruitment times, where 
ti < ■■■ <tn. 

Definition 4 (Coupon matrix). Let C he the n x n coupon matrix whose element Cij is 1 
if subject i has at least one coupon just before the jth recruitment event, and zero otherwise. 
The rows and columns of C are ordered by subjects’ recruitment time. 

The observed data from the RDS recruitment process is Y = (Gij,d, t,C). Figure 
illustrates the observed data and their relationship to the unobserved population graph G. 
Since the recruitment graph Gr does not contain any edges along which a recruitment event 
did not take place, the recruitment-induced subgraph is not fully observed. However, 
the observed degrees d and the edges in the recruitment graph Gr place restrictions on the 
number of non-recruitment edges that can connect vertices in Vs, and it is intuitively clear 
that an estimate Gs oi Gs must adhere to certain compatibility conditions. 

Definition 5 (Compatibility). An estimated subgraph Gs = (Ks, Es) is compatible with the 
observed data (Gi?,d) if the following conditions are met: 1. the vertices in the estimated 
subgraph are identical to the set of recruited vertices: v E Vs if and only if v E Vr; 

2. all directed recruitment edges are represented as undirected edges: for all {i,j) E Er, 
{i,j} G Es; 3. the number of edges in Gs belonging to each sampled vertex does not exceed 
the vertex’s degree: for all v G Vr, Ill'll^ ^s} < d^, where d„ is the degree of 

vertex v. 

These compatibility conditions provide topological constraints on the structure of G^. 

3 Inference for the population size 

In this section, we construct a probability model by which the observed data Y = (G^, d, t, C) 
in an RDS survey are linked to the number of vertices N in the target population. Figure 
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Figure 2: Illustration of population size estimation task using the graphical structure of 
data obtained by RDS. We seek the number of vertices N in the graph G = {V,E). The 
observed recruitment graph Gr is shown at left. Each vertex is augmented with the number 
of pendant edges implied by its degree. RDS data do not directly reveal which of these 
pendant edges connect to observed vertices, and which connect to unobserved vertices. At 
right, the recruitment-induced subgraph Gs has been reconstructed, revealing the number 
of edges that connect to unsampled vertices at each step of the recruitment process. Section 


3.1 provides a derivation of the likelihood of N given Gs- 


illustrates the problem of estimating the number of vertices in G from the recruitment- 
induced subgraph Gs- First we show that if the recruitment-induced subgraph Gs is known, 
a simple statistic - the number of pendant edges connecting each sampled vertex to un¬ 
sampled vertices at the moment of recruitment - can be used to derive the likelihood of N 


conditional on G 5 . Next, we appeal to results by Crawford (2014) giving the likelihood of 


the recruitment-induced subgraph Gs and a per-edge recruitment rate parameter A. Our 
strategy is to marginalize over Gs and A to arrive at the posterior distribution of N. 


3.1 Likelihood of N given Gs under the Erdds-Renyi model 

We hrst state some assumptions about the social network connecting members of the hidden 
population and the RDS recruitment process on this network. 

Assumption 1 (Existence of a network). The target population social network is a finite 
graph G = (V, E) with no parallel edges or self-loops. 


Network-based methods for population inference must make homogeneity assumptions to 
ensure that a sub-sample of the network can be used to make inference about the total 
network. In the Erdds-Renyi random graph model, each edge between vertices is formed 


independently with probability p (Erdos & Renyi 1959, 1960). Let G ~ Q{N,p) denote an 
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Erdos-Renyi random graph. The degree di of a vertex i has distribution 


di ~ Binomial(A^ — l,p), 


( 1 ) 


where N = \V\. The likelihood of a particular graph G depends only on the number of 
edges \E\, 

L(iV,p|G') (2) 

The Erdos-Renyi random graph model formalizes the notion of independent and identically 
distributed (with probability p) formation of reciprocal social ties between individuals in a 
hnite population. While the Erdos-Renyi model is believed to be a poor generative model 


for non-hidden social networks (Watts & Strogatz 1998, Robins et ah 2001), very little is 


known about the structure of contacts between members of highly stigmatized or crimi¬ 
nalized populations. The Erdos-Renyi model has proven to be empirically very useful for 


estimating hidden population sizes: both the snowball sampling estimator (Frank & Sni- 


jders 1994) and the network scale-up estimator Killworth et ah (1998) rely on equivalent 


network homogeneity assumptions. The real-world usefulness of the Erdos-Renyi model for 
hidden population size estimation suggests that an approximate relationship between indi¬ 
vidual degrees and the population size N like ([^ may hold in some real-world populations. 
We now specify the distribution of the hidden population graph. 

Assumption 2 (Network model). The target population graph has Erdos-Renyi distribu¬ 
tion, G ~ Q{N,p). 

The likelihood of conditional on G^ and d under the Erdos-Renyi model depends on 
assumptions about the dynamics of the RDS recruitment process. But there is signihcant 


disagreement about how to model the recruitment process (Salganik & Heckathorn 2004 


Gile fc Handcock 2010 , Gile [2011 , [Berchenko et ah 2013 Grawford 2014 Malmros et al 


2014). We therefore make a simple assumption that permits calculation of the distribution 


of a statistic of Gs under Assumption]^ Gall a vertex a recruiter if it has at least one coupon 
and shares an edge with an unrecruited vertex. Gall a vertex susceptible to recruitment if 
it has not yet been recruited and shares an edge with a recruiter. 
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Assumption 3 (RDS sampling probabilities). The next recruited vertex is chosen from 
among all susceptible vertices with probability that depends only on the edges it shares with 
recruiters. The edges connecting the newly recruited vertex to other unrecruited vertices do 
not affect its probability of being recruited. 

Assumption provides a connection between the recruitment probability for each vertex 
and the structure of the network. 

Under Assumptions and the recruitment-induced subgraph Gs is not an Erdds- 
Renyi graph because new vertices may not be chosen uniformly at random from the set of 
unrecruited vertices. However, since recruitment probability does not depend on edges not 
connected to active recruiters, it does not depend on edges connecting unrecruited vertices 
to other unrecruited vertices in particular. This intuition yields a suitable probability model 
linking the subgraph Gs to the population size N. Suppose Gs is known and let df be the 
number of edges belonging to vertex i that connect to unknown vertices at the moment i 
is recruited (recall that the indices i are ordered by the time of entry into the study), 

2—1 

= (3) 

Then by independence of edges in the Erdos-Renyi model, 

df ~ Binomial(A — i, p) (4) 

unconditional on d* and dj for j i and independently of df for j i. In words, the 
number of edges connecting a recruited vertex to unrecruited vertices (at the moment it 
is recruited, before observing its total degree) depends only on the number of remaining 
unrecruited vertices and p. These d“ connections are formed independently with probability 
p and without replacement to any of the N — i remaining unsampled vertices. 

The presence of the population size parameter A in (|^ suggests that the sequence of 
d“’s may contain information about N. Since df, ..., d“ are independent binomial random 
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variables, the joint likelihood of N and p, given Gs and d, is 


L{N,p\ds,Y) 



P'^'i (1 — p) 


N-i-dY 


where is calculated from Gs and d via (|^. 


( 5 ) 


3.2 Likelihood of Gs 

The likelihood ([^ permits estimation of N, conditional on observation of the recruitment- 
induced subgraph Gs- However, Gs is not directly revealed by the observed data Y. 
The statistic d“ = (d“,..., djj) is sufficient for N and p, but the graphical structure of 
Gs induces complex combinatorial dependencies in the elements of d“, and the marginal 
probability distribution of d“ cannot be represented in a simple way. We therefore seek 
a probability model for Gs given Y, and marginalize over the unobserved portion of this 
graph with respect to this model. The compatibility conditions given in Dehnition place 
strong restrictions on the structure and density of Gs- Let C(Gij,d) denote the set of all 
recruitment-induced subgraphs that are compatible with the observed data Gr and d. 

The least restrictive option is to marginalize over Gs with respect to the uniform distri¬ 
bution on C{Gr, d) by setting 7i{Gs) oc 1 for Gs G C{Gr, d) and zero otherwise. However, 
the uniform distribution over C{GR,d) does not give rise to the uniform distribution over 
IT's!, and most subgraphs in C(Gij,d) have far more edges than the true subgraph Gs- 
The result is that the uniform distribution over subgraphs Gs results in a highly informa¬ 
tive distribution over d“ that does not place most of its mass near the true value of d“. 
Alternatively, we could place a prior distribution over the number of edges in Gs- To 
illustrate, let n^Gs) oc exp[—for Gs € C{Gr, d) and zero otherwise. Choosing 7 > 0 
penalizes dense subgraphs, and all subgraphs with a given number of edges have the same 
probability under this model. 

A more sophisticated marginalizing distribution can be derived from the time series of 
recruitment events. By making assumptions about the time dynamics of the recruitment 
process on Gs, we can calculate the likelihood of the observed recruitment times t condi¬ 
tional on Gs to develop a probability model for Gs- The recruitment model depends on 
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the following assumptions, drawn directly from Crawford (2014). 


Assumption 4. Vertices become recruiters immediately upon entering the study and re¬ 
ceiving one or more coupons. They remain recruiters until their coupons or susceptible 
neighbors are depleted, whichever happens first. 


Call an edge in G susceptible if it links a recruiter and a susceptible vertex. 


Assumption 5. When a susceptible neighbor j of a recruiter i is recruited by any recruiter, 
the edge connecting i and j is immediately no longer susceptible. 

Assumption 6 (Exponential waiting times). The time to recruitment along an edge con¬ 
necting a recruiter to a susceptible neighbor has exponential distribution with rate X, inde¬ 
pendent of the identity of the recruiter, neighbor, and all other waiting times. 

Assumptions |4]j^ are consistent with Assumption (for proof, see Propositions 1 and 2 of 


Crawford 2014). 


The likelihood of the recruitment time series on a hxed graph can be computed under 
this model. Let w = (0, ti—0, t 2 —ti, ..., tn—tn-i) be the vector of inter-recruitment waiting 
times. Let A be the adjacency matrix of Gs, where the rows and columns of A correspond 
to vertices in the order of their recruitment into the study. Let u be the n x 1 vector 
whose ith element is the number of pendant edges emanating from i to unsampled vertices, 
Uj = di - Ap. Then the joint likelihood of Gs and the waiting time parameter A is 
given by 


where 


L{Gs,X\Y) = j Xsj j exp[-As'w], 


s = lowerTri(AC)'l -h C'u 


( 6 ) 


(7) 


and M is the set of seeds (Crawford 2014). Information from the subgraph Gs enters 


the likelihood through the vector s, the number of susceptible vertices just before each 
recruitment event. 
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3.3 Posterior distribution of N 


We now combine the likelihood expressions ([^ and ([^ with prior information to formulate 
the posterior distribution of N. The joint likelihood is L{N,p, Gs, A|Y) = L{N,p\Gs, Y) x 
A|Y). Assume N, p, Gs, and A are a priori independent with prior distributions 
7r(A^), 7r(p), ■n{Gs), and 7 r(A) respectively. The posterior distribution of N is obtained by 
marginalizing over compatible subgraphs, p, and A, 

poo pi 

Pr(iV|Y) oc 'k{N)Y,<Gs) / T(G 5 ,A|Y) 7r(A) / L(iV,p|Gs,Y) 7r(p) dp dA. (8) 


where the sum is over compatible subgraphs Gg G C(G/j,d). Let p have Beta(a,/3) dis¬ 
tribution with density 7r(p) oc — pY~^. Let A have Gamma(p,.^) distribution with 

density vr(A) oc Then integrating analytically over p and A in (|^, the posterior 

distribution of N becomes 


Pr(Ar|Y) oc 7r(Y) ^ 
Gs 


(s'w 


n 


n ("r) 


B {D^ + a,nN - + 13) (9) 


where B(-, ■) is the Beta function and D'^ = ® computed using Gg- A 

derivation of (|^ is given the Supplementary Materials. 


3.3.1 Prior distributions for N 

Not every value of N is feasible: since no parallel edges are allowed under Assumption 
N must be large enough to accommodate all the edges emanating from sampled vertices. 
Therefore, we need A^ > u -|- maxj df for the (i“’s derived from a particular subgraph Gg. 
Rather than make the prior 7r(A^) conditional on each particular realization of Gg, we note 
that < di for every compatible Gg and impose the simpler constraint N > n + maxj di, 
which does not depend on any particular Gg. For surveys where Y S> n, this should not 
pose a problem for estimation of N. Setting Y^in = n + maxjd,, we will always consider 
(|^ to be defined only for Y > Ymin- 

A relatively uninformative class of prior distributions for Y is the power-law mass 
function 7 r(Y) oc Y“'^ where c > 0 and Y > Ymm- When c > 1 the prior density is 
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proper: YlN=Nmin ^(-^) < When c > 2 the prior mean exists, and when c > 3 the 
prior variance exists. However, researchers may prefer not to specify a strongly informative 


prior for N, and c = 1 is a popular choice (Draper & Guttman 1971, Raftery 1988). 
Unfortunately the posterior distribution Pr(A^|Y) may not behave well for some values of 
Kahn ( 1987[ ) rns that estimates based on the beta-binomial distribution can have 


c: 


undesirable properties under some priors 7r{N). In the Supplementary Materials, we show 
that the posterior mass function ([^ is a proper probability distribution when a -|- c > 1; 
when a -f- c > 2 the posterior mean is hnite, and when a + c > 3 the posterior variance 
is hnite. When posterior moments of interest do not exist, it may be tempting to posit 
K^max, the largest permissible estimate of N, and letting 7r(iV) oc < N < iVmax}- 

But since the posterior moments for unbounded N are undehned, their estimates under 
the truncated prior depend acutely on the choice of iVmax and are less inhuenced by the 
observed data ( |Kahn 1987). We therefore consider below specihcations of 7r(iV) such that 
the prior has inhnite support, the posterior is proper, and at least the hrst two moments 
exist. While this inevitably results in a more informative set of priors, it seems a small 
price to pay for hnite posterior mean and variance. 


3.4 Monte Carlo sampling 

The posterior distribution (|^ is obtained by marginalizing over compatible subgraphs 
Gs- Under the compatibility conditions in Dehnition this sum cannot be performed 
analytically and the distribution of N conditional on Gs does not have a standard form. 
We therefore resort to Gibbs sampling: hrst we sample Gs conditional on N, then sample 
N conditional on Gs- Sampling Gs is remarkably efficient because update expressions 
are available for the statistic s in the likelihood ([^, making the matrix multiplications in 
([^ unnecessary. Integration over compatible subgraphs Gs is accomplished by proposing 
changes to the connectivity of Gs, then using a Metropolis-Hastings step to accept or reject 
the proposal. Sampling N given Gs relies on a close approximation to the conditional 
distribution. The Supplementary Materials provide a comprehensive description of the 
Gibbs sampling routine. 
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4 Validation using simulated data 


We performed simulations to validate the proposed method for population size estimation 
from RDS data under the model outlined in Section First, we simulate an Erdds- 
Renyi population network G = {V,E) with \V\ = N = 1000, 5000, and 10000, p = 5/N, 
10/N, and 15/iV. Conditional on the simulated population graph, we simulate the RDS 
recruitment process under typical real-world study conditions with n = 500 recruitments 
starting from \M\ = 10 seeds, and three coupons per recruit using the model described by 


Crawford (2014). From the simulated recruitment data, we extract Y = (Gi{,d,t,C) and 


estimate the posterior distribution of N given Y outlined above. 

We employ weakly informative priors for the unknown parameters. We assign to N 
the vague improper prior distribution 7i{N) oc N~^. For the edge density p we assign 
p ~ Beta(Q;,/3), with a > 2 and {3 = a(l — Ptrue)/Ptrue, where ptme is the true value of p. 
This specihcation ensures that the posterior distribution of N has hnite second moment and 
the prior expectation of p is equal to Ptrue- To evaluate the sensitivity of posterior estimates 
to variation in the prior parameters, we set a = 3, 10, and 20; we set the prior variance 
for A to ua = 1, since simulation results appear to be insensitive to the prior variance for 
A. The prior for Gs is n{Gs) oc exp[- 7 |F;s|] where 7 = - log (ptrue/(l - Ptme))- For the 
waiting time parameter A, we specify p and ^ to give prior mean equal to the true value Atrue 
and prior variance vx. Then we let p = and ^ = Xtme/vx, which gives E[A] = Atrue 

and Var[A] = vx- The true value is Atrue = 1 for all simulations. 

For each parameter combination, we simulated 100 independent networks and RDS 
datasets, and for each dataset, we estimate N via its posterior mean. Table shows 
posterior summaries for the simulated data. For each set of 100 simulations, the true N, the 
expected degree Np, and a are shown. We report the mean of all 100 posterior means, the 
standard deviation (SD) of the posterior means, and the relative bias (E[iV|Y]—iVtrue)/-^true- 
Posterior means of the population size N show low, mostly positive bias. The relative bias 
does not seem to increases slowly with N. The standard deviation (SD) of posterior means 
increases with higher N. Estimates of N exhibit least bias when a is large, indicating 
greater certainty about the edge density p. With a = 3, the posterior mean exists, but the 
posterior variance does not. This setting explores estimation under the weakest possible 


14 





prior assumptions about p that nevertheless guarantee that the posterior mean exists. 
While it seems from Table that some bias is present when a = 3, it is encouraging that 
relatively weak prior assumptions can still give rise to reasonable estimates. 


5 Application: how many people inject drugs in St. Pe¬ 
tersburg? 

The Russian Federation has experienced simultaneous epidemics of drug abuse and HIV 
infection since the mid-1990s, and HIV prevalence is highest among people who inject 
drugs (PWID) (Abdala et ah 2003, Rhodes et ah 2004, World Health Organization 2005| 
Pokrovsky et al.|2010 UNAIDS]2010a ). Drug possession in Russia can result in serious legal 
penalties, including incarceration, loss of employment, and revocation of driving privileges. 
HIV-positive people in Russia are often subject to strong social stigma and may lack access 
to treatment and education resources (Balabanova et ah |2006| Sarang et al. 2012, Burke 
et al.|20i5). In St. Petersburg, Russia, HIV incidence and prevalence are high among PWID 


(Kozlov et al. 2006, Niccolai et al. 2011), and researchers have found that many PWID do 


not have ready access to HIV testing and are not aware of their HIV status (Niccolai 


et al. 2010). PWID in Russia often obtain drugs through local social networks connecting 


drug dealers and buyers (Shaboltas et al. 2006, Cepeda et al. 2011). The social nature of 


the drug scene in St. Petersburg creates problems for public health and epidemiological 
research on PWID (also called injection drug users - IDUs): “Such a structure makes it 
difficult to recruit through outreach and easier to recruit by allowing IDUs to penetrate 


their own network of contacts” (Shaboltas et al. 2006, page 662). PWID in St. Petersburg 


therefore constitute an epidemiologically important hidden population, connected by a 
social network, for which random sampling is impossible. 

Knowledge of the size of the PWID population in St. Petersburg would substantially 
illuminate the number of people at risk for HIV infection, and could help determine the 
scale and scope of education, treatment, and intervention programs in that community. 
To estimate the number N of PWID in St. Petersburg, Heimer fc Whit^ (2010) use a 
multiplier method with estimated HIV prevalence (from a different RDS study), HIV testing 
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Simulation 



Posterior 

Parameters 

Prior 

Estimates of N 

N Np 

a 

Mean 

SD 

rel bias 

1000 5 

3 

1010 

104 

0.010 


10 

1011 

97 

0.011 


20 

1010 

90 

0.010 

10 

3 

964 

67 

-0.036 


10 

953 

62 

-0.047 


20 

967 

63 

-0.033 

15 

3 

952 

54 

-0.048 


10 

950 

52 

-0.050 


20 

953 

51 

-0.047 

5000 5 

3 

5953 

3208 

0.191 


10 

5416 

1664 

0.083 


20 

5126 

1091 

0.025 

10 

3 

7072 

4071 

0.414 


10 

5327 

1495 

0.065 


20 

5093 

1038 

0.019 

15 

3 

6134 

2742 

0.227 


10 

5434 

1421 

0.087 


20 

5109 

999 

0.022 

10000 5 

3 

14240 

8188 

0.424 


10 

10829 

3536 

0.083 


20 

10404 

2356 

0.040 

10 

3 

14931 

9114 

0.493 


10 

10932 

3426 

0.093 


20 

10519 

2357 

0.052 

15 

3 

13105 

7010 

0.311 


10 

10767 

3372 

0.077 


20 

10415 

2302 

0.042 


Table 1: Simulation results with RDS sample size n = 500, \M\ = 10 seeds, A = 1, and 
three coupons per subject. The true value iV, the expected degree iVp, and prior parameter 
a are shown for each set of 100 simulations. We report the average posterior mean, average 
posterior SD, and relative bias. Posterior means of N exhibit low (but mostly positive) 
bias, with higher SD, as the true population size increases. Estimates are most accurate 
when prior information about p is strong, and a is large. 


16 



frequency, and other sources of information to obtain N = 83118 ±5799. Given that nearly 
all epidemiological research on PWID in St. Petersburg uses RDS to recruit participants, a 
method for estimating population size directly from RDS data would be particularly useful. 

We analyze data from an RDS study of PWID in St. Petersburg performed during 
2012-2013. Researchers recruited n = 813 PWID using 17 seeds and conducted interviews 
to gauge perceived barriers to use of HIV prevention and treatment services. While the 
study was not intended to be used for population size estimation, its size and adherence 
to the traditional RDS recruitment protocol outlined by Heckathorn (|1997) make it an 


appealing opportunity for population size estimation. Crawford (2014) shows the observed 
data Y = {Gr, d, t, C) from this study and describes the recruitment procedure in detail. 

We investigate estimation of N under the vague prior tt{N) oc iV“^, A ~ Gamma (?7 = 
1,.^ = 1), and several specihcations for 7 r(p), indexed by the parameters a and (3. To hnd 
a suitable prior for p that takes into account both the previous population size estimate of 


Heimer & White (2010) and the requirement that the hrst two moments of the posterior 


distribution exist, we adopt an empirical Bayes approach. In the Supplementary Materials, 
we describe a method for prior elicitation using a lower bound for p, given a prior estimate 
N of N. For the St. Petersburg data, we hnd that this bound is pio = 1.26 x 10“®. We 
hx different values of a > 3 and choose /d > 0 such that Pr(p > pio|a,/?) = 0.99 under 
the Beta distribution for p. We consider a = 3.1,4,5, 6 , 7, and 8 . As before, we set 
^{Gs) <x. exp[—ylidsi] where 7 = log(p/(1 — p)), with p = a/{a + (3). 

Table shows posterior summaries for the estimated number N of PWID in St. Peters¬ 
burg under each prior specihcation. The posterior mode, mean, standard deviation (SD), 


and 95% posterior quantiles are shown. Heimer & White (2010) state that at least 30,000 


cases of HIV in PWID have been reported; the 2.5% quantile for a = 3.1 is just below 
this number. Under this prior specihcation, increasing values of a decrease the prior mean 
of p, giving larger posterior estimates and variances of N. The posterior mean E[Y|Y] 
is more sensitive than the mode to changes in a because it is strongly ahected by the 
thickness of the right-hand tail of the posterior distribution. We obtain posterior mode 


estimates between 53,000 and 210,000, which are generally compatible with that of Heimer 


& White (2010): posterior quantile intervals corresponding to a = 3.1,4, 5, and 6 contain 
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Prior 


Population size N 


Prevalence (%) 

a 

Mode 

Mean 

SD 

2.5% 

97.5% 

20-45yrs 

All 

3.1 

53797 

89332 

65919 

28968 

237986 

5.9 

1.8 

4.0 

77187 

101692 

48046 

41394 

216854 

6.8 

2.1 

5.0 

100307 

141534 

65067 

58954 

309433 

9.4 

2.9 

6.0 

125086 

167100 

69161 

78403 

345233 

11.1 

3.4 

7.0 

152588 

209075 

91636 

95078 

442319 

13.9 

4.3 

8.0 

170167 

195711 

64268 

98111 

351162 

13.1 

4.0 


Table 2: Estimates of the number of people who inject drugs in St. Petersburg, Russia from 
an RDS dataset of n = 813 subjects. The prior for p depends on r, dehned in the text. 
Posterior means, standard deviations, and 2.5% and 97.5% quantiles are shown. The last 
two columns show the approximate implied prevalence (%) of injection drug use in 20-45 
year-olds and for all residents of St. Petersburg. 


their estimate N = 83,118. Setting a = 8 results in the highest estimates of over 200,000; 
estimates substantially larger than this may not be credible. The total number of people 
in St. Petersburg is approximately 4.9 million, and Heimer & White ( 2010| estimate the 
number who match the age range (20-45 years) characteristic of PWID as approximately 
1.5 million. The last two columns give the implied prevalence of injection drug use in both 
of these groups, computed using the posterior mean. Posterior expectations and quantiles 
of N in Table are sensitive to the prior mean of p. The conditions required for the pos¬ 


terior distribution of N to have hnite variance necessitate informative priors for p (Kahn 


1987). Nevertheless, the estimates of the number N of PWID in St. Petersburg are in 


general agreement with those of Heimer & White (2010) and span a range of reasonable 
values. 


We also analyze the St. Petersburg data using the SS-size method described by Hand- 


cock, Gile & Mar (2014) and Handcock et ah (2015). Results are shown in Table 1 of the 


Supplementary Materials. The SS-size model and the method proposed in this paper are 
quite different, but we have attempted to impose similar prior specihcations so that the re¬ 
sults are comparable between the two approaches. The posterior estimates from the SS-size 
method generally fall between 1000 and 4000 when the raw degrees d are used, which is not 
within the feasible range for the number of PWID in St. Petersburg. Estimates increase 
to between 20,000 and 100,000 when subjects’ reported degrees are “imputed” by the SS- 
size software. Estimates under the SS-size model are highly sensitive to a user-specihed 
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maximum N value. Setting this maximum to 500, 000 results in the largest estimates. The 
prior distribution imposed on N does not seem to greatly affect the posterior estimates in 
the SS-size method. Estimates from the SS-size model using the raw degrees d imply that 
the prevalence of injection drug use is between 0.09% and 0.18% for 20-45 year-olds and 
between 0.03% and 0.06% for all residents of St. Petersburg, which is far lower than the 
known minimum prevalence based on the number of registered PWID, and the number of 
PWID known to be HIV-positive. 


Gile (2011), Handcock, Gile & Mar (2014), Handcock et al. (2015), and Gile et ah (2015) 


argue that degrees of subjects recruited by RDS should decrease as the sample accrues. 
One possible reason for the poor performance of the SS-size method in this dataset is that 
the time-ordered degrees do not adhere to this assumption. The mean reported degree 
in the St. Petersburg dataset is 10.26 with SD 8.5; the maximum reported degree is 200. 
Figure 1 of the Supplementary Materials shows the reported degrees and a linear regression 
line overlaid. To test whether the time-ordered sample of subjects’ degrees decreases, we 


use the approach suggested by Gile et al. (2015) and regress the integers 1,... ,?t, on the 
observed degrees d, ordered by the time of recruitment. We employ linear, Poisson, and 
M-estimation with Huber and bisquare weighting. We fit these regression models using the 
full dataset of n = 813 reported degrees and with the same dataset excluding one outlier 
subject who reported d = 200. The results are shown in Table 2 of the Supplementary 
Materials. The estimated slope coefficient is always small and positive. There does not 
appear to be a significant negative trend in the reported degrees, and we conclude that 
average reported degrees do not decrease in this dataset. 


6 Discussion: models and assumptions 

We have presented a method for estimating the size of a hidden population from data 
collected during RDS surveys. The proposed estimation method recovers the true value 
of N accurately in simulations, and gives reasonable results in the application to estimate 
the number of PWID in St. Petersburg. The modeling approach relies on several assump¬ 
tions about the social network connecting members of the target population and the RDS 
recruitment process. In this section, we examine the basic assumptions underlying the 
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method, and compare them to those made by Handcock, Gile fc Mar|(2014) and Handcock 


et ah (2015) in deriving and justifying the SS-size estimator. 


6.1 The network 

In this paper, we have assumed that there exists an undirected social network G = (V, E) 
connecting members of the hidden population, and Assumption states that this network 
follows the Erdos-Renyi distribution. Human social networks are not usually well charac¬ 


terized by the Erdds-Renyi model (Watts & Strogatz 1998, Robins et ah 2001). However, 


the Erdos-Renyi model has appealing properties in the context of hidden population size 
estimation: first, the likelihood ([^ is simple and does not require calculation of a nor¬ 
malizing constant. Second, the Erdos-Renyi model reflects our general ignorance about 
the social structure of hidden populations; setting p = 0.5 gives the “uniform” distribu¬ 
tion on graphs. Third, because even small subgraphs can provide information about N in 
the Erdds-Renyi model, (|^ does not require that the network be connected, nor that the 
sample take place in the giant component. Finally, and most importantly, the Erdds-Renyi 
model has proven to be empirically useful in a wide variety of population size estimation ap¬ 


plications via the snowball sampling estimator (Frank & Snijders 1994, David & Snijders 


2002) and the network scale-up method (e.g Bernard et ah j2001 ). The success of these 
methods in real-world applications suggests that there may be some merit to the notion 
that certain kinds of acquaintanceships form somewhat independently and with common 
probability. Moreover, the proposed method uses data from RDS surveys of hidden pop¬ 
ulation members, whose within-group edge probabilities may be more homogeneous than 
between-group probabilities in the general population. 


In contrast, the SS-size model of Handcock, Gile & Mar (2014), Handcock et ah (2015) 


does not assume the existence of a network, and assigns degrees of unsampled vertices in¬ 
dependently from a pre-specified parametric distribution. This approach is unburdened by 
graph-theoretic constraints on the population network, since the set of population degrees 


drawn in this way need not correspond to the degree sequence of any graph (see e.g. Erdos 


fc Gallai||1960 ). More importantly, inference under the SS-size model is not constrained by 
topological conditions imposed by the observed recruitment graph Gr and the degrees in 
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the subgraph of respondents, as in Dehnition In the SS-size method, network topology 
local to recruited vertices does not play a role in recruitment of the sample. This lack 
of graphical constraints in the SS-size model suggests a view of RDS recruitment that is 
not network-based: subjects’ reported degrees might be regarded as surrogate measures 


of “visibility” in the population, and Handcock, Gile & Mar (2014) and Handcock et ah 


(2015) takes sampling probability proportional to visibility. 


6.2 The recruitment process 

Assumption states that the probability that a susceptible vertex is recruited depends 
only on its edges connecting to active recruiters, and does not depend on edges connect¬ 
ing to unsampled vertices. In contrast, the SS-size method largely avoids modeling the 
recruitment process by assuming that sampling of subjects in RDS occurs with probability 
proportional to their total degree, without replacement. This assumption has two impor¬ 
tant implications that highlight the difference between the SS-size model and the method 
developed in this paper. First, the SS-size model of recruitment is not compatible with 
Assumption!^ which states that the probability that a given vertex is recruited depends on 
the edges it shares with recruiters, and does not depend on edges that connect this vertex 
to other unrecruited vertices. Indeed, under the SS-size model, network topology implied 
by the recruitment graph Gr is irrelevant to the recruitment process and vertex degrees 
are treated as “sizes” in a “probability proportional to size without replacement” sampling 


scheme (e.g. Bickel et ah 1992). Second, the degrees of recruited subjects should decrease 
over time as the sample accrues under the SS-size model. We did not observe such a de¬ 
crease in mean degree in the St. Petersburg data (see the Supplementary Materials). Nor 


did Gile et al. (2015, Supplementary Materials), who hnd that in robust regression analyses 
of twelve separate RDS datasets, “[sjurprisingly, we hnd little evidence of decreasing degree 
over time”. 

However, there is reason to believe that network topology matters in determining who 
can be recruited, that RDS sampling probability is not proportional to degree, and that 


degrees need not decrease during an RDS study. Grawford (2014) argues that if RDS 


recruitments happen over edges of a population network, sampling probability has little 
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to do with total degree. Instead, the number of edges each potential subject (susceptible 
vertex) shares with recruiters determines their probability of being sampled in the next 
recruitment. Indeed, a potential subject who shares no edges with recruiters cannot be 
recruited, regardless of their degree. Worse, sample sizes for RDS studies are usually set in 
advance, so a potential subject whose location in G is more than n edges from a seed can 
never be recruited, regardless of their degree. When average degree does not decrease over 
the time-ordered sample, the assumptions underlying the SS-size method may not be met, 
and the likelihood of the ordered degrees under the SS-size process may not be informative 
for N. 

We also assume that per-susceptible-edge waiting time to recruitment is memoryless 
(Assumption |^, which provides a convenient marginalizing distribution over subgraphs 
Gs in ([^. To justify this assumption, we draw an analogy between the RDS recruitment 
process and the spread of an infectious disease on a population network. The contact 
process between “susceptible” vertices and “infective” recruiters closely parallels models 
that have gained wide use in epidemiology. The main difference is that recruiters can 
deplete their coupons in RDS, which renders them unable to recruit others. The incentive 
for recruiting other participants in RDS may also provide some justihcation for exponential 
waiting times: the need for money may be essentially memoryless. 

7 Conclusion: RDS for population size estimation? 

RDS was not designed for population size estimation, and it should not be used if other 
options like census enumeration or capture-recapture are available and the assumptions 
necessary for their use are justihed. But RDS remains a popular survey method for good 
reason: it is a remarkably effective procedure for recruiting subjects who might otherwise be 
reluctant to participate in a research survey. The lack of better methods for learning about 
hidden populations suggests to us that RDS will hnd continued use by epidemiologists 
and public health researchers in the future. We have shown in this paper that by making 
some assumptions about the network and the nature of the RDS recruitment process, 
the observed data from an RDS study can provide useful information about the target 
population size. The assumptions underlying this method may be justihed when researchers 
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believe that the population network exists, and subjects are recruited across its edges. 


Supplementary Materials 
8 Posterior distribution of N 

Let n{p) oc and 7r(A) oc be prior distributions. We hnd the posterior 

distribution of N by marginalizing over subgraphs Gs € C{Gr, d), N, and p, 

_ poo p 1 

Pr(iV|Y) oc 7r(iV)5^7r(Gs) / L(G 5 ,A|Y) 7r(A) / L{N,p\Gs,Y) 7r{p) dp dA. (10) 


The integral over A is 


L(Gs,A|Y) 7r(A) dA oc 


J 


)[-As'w] dA 


rijyAr 


(s'w + 0 


n—m-\-ri 


and the integral over p is 


L{N,p\Gs,Y) 7r{p) dp oc 


n -p)^ p" ^(1 -p)^ ^ dp 


N-i 


B (D'^ + a,nN- -D^ + f3 
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where the (i“’s are computed from Gs and d, B(-, •) is the Beta function. 

The marginal posterior distribution of N is therefore 


Pr(Ar|Y) oc 7r(iV)^ 
Gs 


'^{Gs) Uji^M 

(s'w + 



B {D'^ + a^nN- -£)“ + /?). 

(13) 


9 Conditions for existence of moments of Pr(A^|Y 


The posterior mass function of N is given by (13). We seek sufficient conditions for the 


posterior mass function to be proper and to have finite hrst and second moments when 
7r{N) oc N~^. First, note that the sum over Gs € C(Gij,d) is hnite, so we ignore the sum 
over Gs and consider the function 


Pr(Ar|Y) oc 




{N 


<)! 


P(nAr - (”+i) + a +/3) 


(14) 


where we have used the dehnition of the Beta function as a ratio of Gamma functions. We 
hrst provide a bound for the product term, then the ratio of Gamma functions. Each term 
in the product obeys the bound 


< 


_ ^'jN-i+l/2^-{N-i)+l 






< 


N 


\N-n-df' 


N-i+l/2 


{N-n- dr^^y^ 


(via Stirling’s approximation) where df^^^ = maxjdj. Then 


n 


(N 


t 




dty. 


< const X 


N 


N — n — d'° 


nN-{"+^)+n/2 


(15) 


{N-n- d““)^ . (16) 
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where the (i“’s are computed from Gs and d and Second, 


r{nN - (”+^) -D^ + /3) ^ {nN - + /3 - i)^^-("2')-^“+/3-i/2g-(niv-("+i)-D“+/3-i)+i 


r(niV-(”+')+a +/3) 


V^inN - ("+') + a + /3 - l)niV-("+^)+a+/3-l/2g-(niV-("+l)+a+/3-l) 

'nN - (”+^) -L)“ + /3- 
nN - (”+^) +a + f3-l ) 

^ {nN - (’"+^) -D^ + /3- 1)-^“ e^“+“+i 
^ {nN - \^f) + a + /3 - 1)“ 


< (niV- ^1 +/?-! 


-D^-a ^£) i *+ q,+i 




(17) 


Combining (16) and ([TT)), we have 


Pr(A^|Y) < const x 


N 


N -n- 


nAf-("+l)+n/2 


{N -n- 


X (niV- ('’" + ^1 +/?-! 


-D'^-a 


N- 


= const X 


N 


N -n- 


nN-{-f)+n/2 / jy _ ^ 


D“ 


riY - (-+1) +P-1 


x(niV-(”^^)+/?-l) N- 


(18) 


The first term converges to one, the second to a constant that does not depend on N, while 
the last two terms dominate in the right-hand tail, and for large N we have 


Pr(iV|Y) ^ l^nN - 
cx 


n -|- 1 


+ 


N- 


(19) 


It follows that a sufficient condition for the posterior distribution to be proper is a -f- c > 1. 
The condition a -|- c > 2 ensures that the posterior mean exists, and a -|- c > 3 ensures that 
the second moment exists, and hence the posterior variance. 
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10 Gibbs sampling for Gs and N 


10.1 Sampling Gs given N 

describes a procedure for drawing a proposal subgraph Gs uniformly from 
the set of compatible subgraphs d). Let m = \M\ be the number of seeds. The 

conditional posterior distribution of Gs is 


Crawford 


(2014 


Pr(G5|iV,Y) oc 


n 




[s'w + 0 


n—m+77 


n ("r) 


2 = 1 


B (B“ + a,nN- (“f) - £>" + , 3 )ir(Gs) 

(20) 

where s, d“, and D'^ are computed using Gs G C{Gr, d). 

Suppose Gs = (V5, Es) is the current estimate of the recruitment-induced subgraph. 
We propose a new subgraph by adding or removing an edge from this graph. To draw a new 
sample from C{Gr, d), we select vertices i and j, with i 7^ j at random. Then if {i,j} ^ Es, 
Uj > 0, and Uj > 0, we propose to add the edge {i,j} to Es- If {i,i} G Es and {i,i} ^ Er, 
we propose to remove the edge {i,i} from Es- Otherwise, we select a different {i,i} and 
try again. The vector of the number of susceptible vertices just before each recruitment is 
s = lowerTri(AC)'l -|- C'u using the current subgraph estimate Gs and let s+ and s“ be 
the corresponding vectors obtained by adding or removing an edge between i and j. It is 


not necessary to compute s via matrix multiplication. Instead, Crawford (2014) provides 
the update expressions 


Sfc l{/c > j}Gik Gjk 

(21) 

> j}Gik + Gjk, 

for k = 1, ... ,n. Now let t* be the time at which vertex i used all its coupons or the end 
of the study, whichever came hrst. Then the change in total edge-time is given by 

= s'w — {t* — min(tj, t*) + t* — tj) 

^ J ^22) 

s“ w = s'w -h {t* — t*) +t* — tj). 

Using these expressions, the ratio of posterior probabilities for N reduces to a simple form. 
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To illustrate, suppose we wish to add the edge to Gs = (ys,Es), where {i,j} ^ Es, 

Uj > 1, and Uj > 1. For a proposal = (V5, E^) identical to Gs except that {i,j} G E^, 

= Uj — 1, and = Uj — 1, the ratio is 

Pr(G+|7V,Y) ^ fn h'l / s'w + e d] nN - ('•+‘) - g“ + /3 x(G+) 

Pr(Gs|iV,Y) s, j + N-j-d] + l D“-l + a t(Gs)' 

(23) 

To illustrate the ratio for removing the edge i,j, suppose Gs = (Vs,Es) has {hj} G Es 
and {f, j} ^ Eji. For a proposal G^ = (Vs, Eg) identical to Gs except that {i,i} ^ Eg, 
u“ = Uj + 1, and u“ = Uj + 1, the ratio is 

Pr(Gs|7V,Y) / s'w + e V C“ + a ^(0^) 

Pr(Gs|]V,Y) Sj I \S"''w + 5/ (tf + l niV-(”+') -D" - 1+ ^ ir(Gs)’ 

(24) 

Suppose G*g is the proposal graph and let Pr(G'g|G 5 ) be the probability of proposing G*g 
from Gs, with N hxed. To decide whether to accept G*g, we form the Metropolis-Hastings 
acceptance probability, 


p = min 


Pr(G-|iV,Y) Pr(G5|G-g) ] 

’ FT{Gs\N,Y)FT{G*g\Gs)j' 


The form of Pr(G 5 |G 5 ) is given by Crawford (2014). 
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10.2 Sampling N given Gs 

The posterior distribution of N conditional on a given compatible subgraph Gg is 


Pr(iV|G5,Y)cx 



B{D^ + a,nN- 


n + 1 


T)“ + /3 7r(Y) 


(26) 


Although this conditional distribution does not have a standard form, we can derive a close 
approximation using the negative binomial distribution when Pr(Y|G 5 ,Y) has a mode. 
Let d^,... ,d^he the number of pendant edges emanating from each sampled vertex at the 
moment they are recruited, calculated from Gs. Suppose for now that N is continuous- 


27 


















valued. We can calculate analytic derivatives of ^(iV) = log Pr(iV|G 5 , Y) as follows: 


di 


dH 


^^(iV - i + 1) - - i - < + 1) 

n + 1 


2=1 


+ 


il) nN 






oi -\- (i 


c 

^~N 




2=1 

+ 


_ (" + d _ 0“ + /sW v.<‘' (niV - ^ 


+ Q; + /3 ) I + ^7777 


C 

iW 


(27) 


where 'ip{x) = is the digamma function and ^ is the polygamma 

function. Let N = a,Tgma,x^i{N) be the mode of Pr(Y|Gs', Y) and let 


V = 


OH 



(28) 


be an approximation to the variance. To draw from Pr(Y|G 5 ,Y) we employ a proposal 
distribution to generate a candidate N* and use a Metropolis-Hastings correction to draw 
from the relevant conditional posterior. We will use N and v to construct a proposal 
distribution for N given Gs- Consider N* ~ NegBin(Y,r), where we have parameterized 
the negative binomial distribution by its mean and size r. The variance of the proposal 
distribution under this parameterization is Y + N‘^/r, so to achieve a proposal variance of 
V, where v > N, set r = N‘^/{v — N). The proposal distribution is 


Pr(Y* = k\N) 


N y r(r + <:) / / N V rCr+j) 

’■ + Y + 


(29) 


where we have normalized by the probability that N* > N m\„ . Then the Metropolis- 
Hastings ratio for the proposal N* conditional on Gs is 


f Ft{N*\Gs,Y) Pr(Y|iV) ] 

I ’ Pr(Y|Gs,Y) Pr(Y*|iV) j ■ 


p = mm 


(30) 



The infinite sum in the denominator of (29) cancels in the ratio (30). 


11 An approximation for prior elicitation 

Suppose we wish to find values of a and that place the prior mean of N approximately 
equal to N, a prior estimate of N. Recall that follows the Beta-Binomial distribution, 
and let p = a/{a + (3). Then E[d^] = {N — i)p and 


E 


i=l 


= p ( nN — 


n + 1 


(31) 


Equating observed and expected values of df and rearranging, we have an estimator for N 
given p. 


N = 


n + 1 




pn ^ 

i=l 


or an estimator for p given N, 


p = 


E".i < 

nN - (”«) 


(32) 


(33) 


Now let iV = in (33). Since Gs is not directly observed in an RDS study, the (i“’s are 


not available. However, we can place a sharp lower bound on the numerator of (33) by 


conditioning on the observed degrees. Let r* be the number of subjects recruited by subject 
i over the course of the study. The number of edges belonging to vertex i connecting to 
unrecruited vertices at the time of its recruitment cannot be smaller than r^. But at most 
i — 1 edges of i can connect to already-recruited vertices, so max{ri, dj — (z — 1 )} is a lower 
bound for df. Recall that M is the set of seeds. Then we have the lower bound 


max{rj, d* — z -|- 1} < dj' 


(34) 


This leads us to a lower bound for p that depends only on N and information contained in 
d and Gr. 


max{ri, di-i + l] 
nN - 


< p 


(36) 
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Figure 3: Degrees of recruited subjects in the St. Petersburg study of PWID. The mean re¬ 
ported degree is 10.26, with SD 8.5. One subject reported degree 200. The linear regression 
line, with slightly positive slope, is overlaid. 


Let pio denote this lower bound. One strategy for prior elicitation is to restrict the prior 
distribution of p so that Pr(p < pio) is small. We therefore fix a and find f3 so that 
Pr(p > pio|a,/?) = 0.99. 


12 Results of SS-size method on the St. Petersburg 
dataset 


Table 1^ shows the estimated number of PWID in St. Petersburg using the SS-size method 


implemented in the “size” package (Handcock, Gile & Mar 2014, Handcock et ah 2015). 
Table shows the results of regression analyses to determine whether the reported degrees 
in the St. Petersburg data decrease over time as the sample accrues. Figure shows the 
reported degrees. 


30 

















Prior Parameters Estimates Implied Prevalence 


n/N 


Max N 

Size 

Mean 

2.5% 

97.5% 

20-45yrs 

All 

Beta (7 = 

1 ) 

200000 

raw 

2744 

2209 

3206 

0.18% 

0.06% 

Beta (7 = 

5 ) 

200000 

raw 

2750 

2209 

3206 

0.18% 

0.06% 

Beta (7 = 

10 ) 

200000 

raw 

2733 

2209 

3206 

0.18% 

0.06% 

Beta (7 = 

1 ) 

500000 

raw 

2072 

1812 

2312 

0.14% 

0.04% 

Beta (7 = 

5 ) 

500000 

raw 

2064 

1812 

2312 

0.14% 

0.04% 

Beta (7 = 

10 ) 

500000 

raw 

2058 

1812 

2312 

0.14% 

0.04% 

Beta (7 = 

1 ) 

200000 

imputed 

41948 

12178 

98911 

2.80% 

0 . 86 % 

Beta (7 = 

5 ) 

200000 

imputed 

43392 

13574 

97515 

2.89% 

0.89% 

Beta (7 = 

10 ) 

200000 

imputed 

46464 

12178 

99509 

3.10% 

0.95% 

Beta (7 = 

1 ) 

500000 

imputed 

28005 

7309 

62274 

1.87% 

0.59% 

Beta (7 = 

5 ) 

500000 

imputed 

22315 

6310 

50782 

1.49% 

0.46% 

Beta (7 = 

10 ) 

500000 

imputed 

27096 

7309 

61775 

1.81% 

0.55% 

Flat (7 = 

1 ) 

200000 

raw 

1436 

1212 

1611 

0 . 10 % 

0.03% 

Flat (7 = 

5 ) 

200000 

raw 

1433 

1212 

1611 

0 . 10 % 

0.03% 

Flat (7 = 

10 ) 

200000 

raw 

1432 

1212 

1611 

0 . 10 % 

0.03% 

Flat (7 = 

1 ) 

500000 

raw 

1350 

1313 

1812 

0.09% 

0.03% 

Flat (7 = 

5 ) 

500000 

raw 

1354 

1313 

1812 

0.09% 

0.03% 

Flat (7 = 

10 ) 

500000 

raw 

1351 

1313 

1812 

0.09% 

0.03% 

Flat (7 = 

1 ) 

200000 

imputed 

27351 

2807 

87945 

1.82% 

0.56% 

Flat (7 = 

5 ) 

200000 

imputed 

31822 

2807 

105690 

2 . 12 % 

0.65% 

Flat (7 = 

10 ) 

200000 

imputed 

40331 

2807 

126626 

2.69% 

0.82% 

Flat (7 = 

1 ) 

500000 

imputed 

26440 

2812 

88258 

1.76% 

0.54% 

Flat (7 = 

5 ) 

500000 

imputed 

38355 

3311 

128733 

2.56% 

0.78% 

Flat (7 = 

10 ) 

500000 

imputed 

90623 

4311 

295628 

6.04% 

1.85% 


Table 3: Estimates from the “size” software of the number of people who inject drugs in 
St. Petersburg, Russia. We obtained posterior estimates under the flat (uniform) prior 
and Beta prior for the sample proportion n/N. The Conway-Maxwell-Poisson (CMP) 
distribution is the prior for the population degree distribution f{d\ri). We obtain results 
under two values for the maximum possible N\ 200,000 and 500,000. We set the prior 
mean of N to 83118 and the prior standard deviation to 7 x 5799 where 7 > 1, based on 
the estimate by ( Heimer fc Whit^|2010 ). By increasing 7 to 5, 10, and 20, we obtain priors 
for N with greater variance. We set the mean, standard deviation, and maximum of the 
degree distribution equal to their sample counterparts. 
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All degrees Excluding d = 200 


Method 

Slope 

SE 

p-value 

Slope 

SE 

p-value 

Linear 

9.2 X 10"^ 

1.3 X 10-5 

0.47 

8.9 X 10-^ 

7.9 X 10-^ 

0.26 

Poisson 

9.0 X 10-5 

4.7 X 10-5 

0.05 

8.9 X 10-5 

4.7 X 10-5 

0.06 

M (Huber) 

1.2 X 10-5 

6.7 X 10-^ 


1.2 X 10-5 

6.7 X 10-^ 


M (Bisquare) 

2.3 X 10-5 

6.7 X 10-^ 


1.3 X 10-5 

6.7 X 10-^ 



Table 4: Regression results for the slope of the time-ordered sample of degrees in the 
St. Petersburg data. The SS method of Handcock, Gile & Mar (2014) and Handcock 


et al.| (2015) assumes that the average degree of recruited subjects decreases as the sample 


accrues. We £t linear, Poisson, and M estimates with Huber and bisquare weighting for 
the full set of degrees, and with one outlier {d = 200) removed. Estimated slope for the 
regression line is always positive, indicating that degrees appear to increase in this sample. 
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